Using Sketches to Estimate Associations

نویسندگان

  • Ping Li
  • Kenneth Ward Church
چکیده

We should not have to look at the entire corpus (e.g., the Web) to know if two words are associated or not.1 A powerful sampling technique called Sketches was originally introduced to remove duplicate Web pages. We generalize sketches to estimate contingency tables and associations, using a maximum likelihood estimator to find the most likely contingency table given the sample, the margins (document frequencies) and the size of the collection. Not unsurprisingly, computational work and statistical accuracy (variance or errors) depend on sampling rate, as will be shown both theoretically and empirically. Sampling methods become more and more important with larger and larger collections. At Web scale, sampling rates as low as 10−4 may suffice.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Sketches to Estimate Two-way and Multi-way Associations

We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are associated or not. A powerful sampling technique called Sketches was originally introduced to remove duplicate Web pages. We generalize sketches to estimate contingency tables and associations, using a maximum likelihood estimator to find the most likely contingency table given the sample, the mar...

متن کامل

Using Hedonic Prices to Estimate Quality Changes concerning Iranian Automobile Market

Abstract This paper sketches a model of product differentiation according to the hedonic hypothesis that is based on the theory of consumer behavior of Lancaster (1971). Lancaster suggested that utility is derived from the characteristics of the good and not the good itself. Thus, from the perception of the consumer, every characteristic has a price. This is the hedonic (or implicit) price. We ...

متن کامل

Thinking with Sketches

Sketches serve to externalize ideas, to render fleeting ideas permanent, to confer coherence on scattered concepts, to turn internal thoughts public. They can be created and recreated, examined and reexamined, configured and reconfigured, considered and reconsidered, for clarity and for creativity. The schematic vocabulary of sketches allows both expression and discovery of ideas. Sketching is ...

متن کامل

A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations

We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are strongly associated or not. One can often obtain estimates of associations from a small sample. We develop a sketch-based algorithm that constructs a contingency table for a sample. One can estimate the contingency table for the entire population using straightforward scaling. However, one can do ...

متن کامل

New cardinality estimation algorithms for HyperLogLog sketches

This paper presents new methods to estimate the cardinalities of multisets recorded by HyperLogLog sketches. A theoretically motivated extension to the original estimator is presented that eliminates the bias for small and large cardinalities. Based on the maximum likelihood principle a second unbiased method is derived together with a robust and efficient numerical algorithm to calculate the e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005